## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
50% of loans, 56576 are Current. Interested in knowing how many are “Active” and how many are “Closed”.
~51.6% of loans are active, meaning they are not Charged Off, Cancelled, Defaulted, and Completed.
For a typical closed loan, how long from Origination to Closed?
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -3.79 0.83 1.44 1.62 2.59 4.27 58848
## ListingKey ListingCreationDate
## 108298 DEAA359893047281162F432 2013-12-27 12:02:50.107000000
## 112976 E77E34330098056070F5C30 2008-10-07 17:09:34.653000000
## ClosedDate LoanOriginationDate AgeYears
## 108298 2010-03-16 00:00:00 2013-12-31 00:00:00 -3.793664384
## 112976 2008-10-16 00:00:00 2008-10-17 00:00:00 -0.002739726
Average age in years from Origination to Closed is 1.62, Median Age is 1.44. We can see from the plot that the distribution is multi-modal and slightly left skewed. Quick note: 2 loans have ages below 0. This is not an issue with the age calculation. Indeed, in these case you can see that the LoanOriginationDate > ClosedDate. This may be an instance of some messy data We ignore these 2 loans and re-summarize this variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.8272 1.4420 1.6230 2.5880 4.2690
It would be interesting to know how these ages vary for different loan statuses (for example, how quickly are loans cancelled?!), and different loan terms. In fact, let’s examine the loan counts by the term.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 40.83 36.00 60.00
##
## 12 36 60
## 0.01416572 0.77040821 0.21542607
Another interesting set of dates to compare is ListingCreationDate and LoanOriginationDate. This captures the time it takes from a loan to be listed to receiving the loan from prosper.
## [1] "Average Days"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.405 4.594 8.175 12.520 12.600 1094.000
## 20% 40% 60% 80% 90% 95%
## 4.044384 6.610001 9.591654 14.103341 20.250690 37.696425
90% of loans are originated within 21 days of listing. The median borrower is waiting about 1 week and 1 day to get their funds. Kind of confused about how cancellations mix with this. Seems like the loans in this data have all been originated. So, we would not have data about loans that were cancelled before origination (all the loans in this dataset have been originated). Therefore, it’s hard to say if the wait time to origination is correlated with pre-origination cancellations – this is obviously data that Prosper would find to be too sensitive to share. It would be interesting to see the relationship between the duration from listing to origination, and the likelihood of cancellation (after origination).
Moving on, let’s see what are the major listing categories for borrowers who’s loans were originated.
##
## Not Available Debt Consolidation Home Improvement
## 16965 58308 7433
## Business Personal Loan Student Use
## 7189 2395 756
## Auto Other Baby&Adoption
## 2572 10494 199
## Boat Cosmetic Procedures Engagement Ring
## 85 91 217
## Green Loans Household Expenses Large Purchases
## 59 1996 876
## Medical/Dental Motorcycle RV
## 1522 304 52
## Taxes Vacation Wedding Loans
## 885 768 771
##
## Not Available Debt Consolidation Home Improvement
## 0.1488980753 0.5117564970 0.0652378069
## Business Personal Loan Student Use
## 0.0630962725 0.0210203885 0.0066352458
## Auto Other Baby&Adoption
## 0.0225738785 0.0921035309 0.0017465792
## Boat Cosmetic Procedures Engagement Ring
## 0.0007460263 0.0007986870 0.0019045613
## Green Loans Household Expenses Large Purchases
## 0.0005178300 0.0175184532 0.0076884594
## Medical/Dental Motorcycle RV
## 0.0133582594 0.0026681412 0.0004563926
## Taxes Vacation Wedding Loans
## 0.0077674504 0.0067405672 0.0067668975
For nearly 24% of loans, a listing category is Not Available (15%) or Other (9%). Top three categories besides these are:
I am surpised that Home Improvement is so high on this list - this would seem more like a luxury loan. But perhaps the borrowers that apply for this loan show credit/income conducive to being able to pay this off easily. Finally, it’s pretty interesting to see categories like Taxes/Vacation/Wedding Loans/Engagement Ring (although they only account for <4% of originated loans)
Let’s see a distribution of the loan sizes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The minimum loan size is $1,000 and the maximum loan origination amount is $35,000. Seems like the most popular loan sizes in descending order are: - $5,000 - $10,000/$15,000 - $20,000/$25,000
75% of loans are less than $12,000. It would be interesting to see the mean/median loan size for each listing category. For example, I would expect the mean/median business loans to be larger than the mean/median debt consolidation.
Let’s see where our borrowers are usually from.
It’s easy to see that California is the home state for a majority of borrowers. - 5% (5500) of loans have no BorrowerState listed - Top 4 states: California, then Texas/New York/Florida. - Top 4 states account for 31% of all loans
Let’s see some more borrower information: Occupation, EmploymentStatus, EmploymentStatusDuration, IsBorrowerHomeowner
##
## Accountant/CPA
## 0.0314910872 0.0283753302
## Administrative Assistant Analyst
## 0.0323687652 0.0316139621
## Architect Attorney
## 0.0018694542 0.0091805120
## Biologist Bus Driver
## 0.0010970975 0.0027734625
## Car Dealer Chemist
## 0.0015798204 0.0012726331
## Civil Service Clergy
## 0.0127877687 0.0017202489
## Clerical Computer Programmer
## 0.0277697324 0.0393024215
## Construction Dentist
## 0.0157104365 0.0005968211
## Doctor Engineer - Chemical
## 0.0043357294 0.0019747755
## Engineer - Electrical Engineer - Mechanical
## 0.0098738777 0.0123401529
## Executive Fireman
## 0.0378366992 0.0037038012
## Flight Attendant Food Service
## 0.0010795440 0.0098563241
## Food Service Management Homemaker
## 0.0108744306 0.0010532136
## Investor Judge
## 0.0018782310 0.0001930892
## Laborer Landscaping
## 0.0139989643 0.0020713201
## Medical Technician Military Enlisted
## 0.0098036634 0.0111640644
## Military Officer Nurse (LPN)
## 0.0030367659 0.0043181758
## Nurse (RN) Nurse's Aide
## 0.0218454058 0.0043093991
## Other Pharmacist
## 0.2511651176 0.0022556325
## Pilot - Private/Commercial Police Officer/Correction Officer
## 0.0017465792 0.0138497591
## Postal Service Principal
## 0.0055030412 0.0027383554
## Professional Professor
## 0.1196099599 0.0048886665
## Psychologist Realtor
## 0.0012726331 0.0047657916
## Religious Retail Management
## 0.0010883207 0.0228371820
## Sales - Commission Sales - Retail
## 0.0302447844 0.0245486541
## Scientist Skilled Labor
## 0.0032649622 0.0241010383
## Social Worker Student - College Freshman
## 0.0065035941 0.0003598480
## Student - College Graduate Student Student - College Junior
## 0.0021503111 0.0009829994
## Student - College Senior Student - College Sophomore
## 0.0016500347 0.0006055978
## Student - Community College Student - Technical School
## 0.0002457498 0.0001404285
## Teacher Teacher's Aide
## 0.0329919166 0.0024223913
## Tradesman - Carpenter Tradesman - Electrician
## 0.0010532136 0.0041865241
## Tradesman - Mechanic Tradesman - Plumber
## 0.0083467179 0.0008952316
## Truck Driver Waiter/Waitress
## 0.0147011068 0.0038266761
##
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
##
## Employed Full-time Not available Not employed
## 0.019791639 0.590870393 0.231312041 0.046929443 0.007328611
## Other Part-time Retired Self-employed
## 0.033404425 0.009549137 0.006977540 0.053836769
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
##
## False True
## 0.4955282 0.5044718
40% of borrowers are Other (25%) or Professional (12%) or Blank (3%). I would expect a little more clarity in this data, as this could impact willingness of investors to fund your loan. After these occupations, the top five are (each between 3% and 4% of total loans): - Computer Programmer - Executive - Teacher - Administrative Assisstant - Analyst
It’s interesting that Prosper has 7 different categories for Students: [Type | Percentage of total loans]
College Juniors/Seniors/Graduate students to be the highest count of loan originations. Freshman/Sophomore students, community college and technical school students have a smaller percentage of the loans. With the current data, we cannot say if this is because of the number of applications, or Prospers loan acceptance process. However, we can check later about which occupations have the best/worst default rates.
Most borrowers are Employed (60%) or Full-Time (23%). <1% for each of the following Not employed/Part-Time/Retired. This is consistent with what I expect.
Borrowers for >7000 loans have been employed for less than 6 months in their current job. So, although it seems many loans don’t go to students, borrowers that are relatively new (<6 months) to their job constitute most of the loans. This is a right skewed distribution.
49.5% of borrowers are homeowners. I would guess that disproportionately large percentage of borrowers doing home improvement are home owners (can check this later)!
let’s take a look at a borrowers credit score and utilization
##
## 0 _ 19 360 _ 379 420 _ 439 440 _ 459 460 _ 479
## 0.116731176 0.000877678 0.004388390 0.031596409 0.123752600
## 480 _ 499 500 _ 519 520 _ 539 540 _ 559 560 _ 579
## 0.303676593 0.486233620 1.398141078 1.293697394 1.191009066
## 580 _ 599 600 _ 619 620 _ 639 640 _ 659 660 _ 679
## 0.987387767 3.161396210 3.661672679 10.706794106 14.364078394
## 680 _ 699 700 _ 719 720 _ 739 740 _ 759 760 _ 779
## 14.474665824 13.578556571 11.342232988 8.133442165 5.797940967
## 780 _ 799 800 _ 819 820 _ 839 840 _ 859 860 _ 879
## 4.058383142 2.320580672 1.236648323 0.497643435 0.186067739
## 880 _ 899 NA _ NA
## 0.023697306 0.518707707
## [1] 0.06456199
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3121 8549 17600 19520 1436000 7604
Most originated loans go to borrowers with credit scores between 640 and 740. 660 - 700 range constitutes about 28% of loans. 6% of loans are to borrowers with credit below 600 or no credit rating at all.
50% of borrowers have a utilization between .3 and .84. Median BankcardUtilization is around .6. That means, at the time the credit profile was pulled, around 60% of available credit was being used by the median borrower. 7% (7604) of loans had a value of NA for BankcardUtilization/RevolvingCreditBalance. A possible explanation for this is that 7% of originated loans are for borrowers with no credit card at the time of applying for the loan. It’s a very interesting distribution, as there are peaks around 0, and 1, and a long tail after 1 (for people who’s utilization is more than their available credit).
I wonder how much the typical prosper borrower earns?
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
##
## False True
## 0.07608591 0.92391409
Most borrowers (56%) in the $25k-$75k annual income. And, 92% of borrowers can prove it. If i’m an investor, its reassuring that the borrower can validate their income source! For those borrowers who’s income is Not Displayed (7%)/Not Employed (<1%), I’m surprised they were approved for a loan and that they got funded.
Speaking of which, since all of these loans are originated, that means they all got full funded, right?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7000 1.0000 1.0000 0.9986 1.0000 1.0120
## [1] 0.0007460263
.07% of loans were not fully funded – But, seems like every originated loan is at least 70% funded. So, maybe the platform covers the rest? Not sure how this works.
Back to borrower features: what’s the typical DebtToIncome ratio for a borrower?
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
Heavily right skewed distribution with a value of 10.01 representing a ratio >= 1000%.
We may try a log10 transformation of this feature. Since this feature can = 0, we will do a log10 transform on the feature + 1.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
We can see the full data on this scale - most peopl fall have a DebtToIncomeRatio between .1 and 1 Median/Mean is .22/.276 and 50% of borrowers are between .14 and .32.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4743 0.0000 0.0000 0.2187 0.4823 1.0000
## [1] 69
Mean unweighted loss for closed loans is around 22%. Some times (69 loans, ~.06% of loans), the loss is negative. This occurs when LP_NetPrincipalLoss <0, which occurs mostly for loans from 2007/2008/2009,and may indicate overrepayment from the borrower.
There are many more variables to explore – we can consider some of them in the Bivariate/Multivariate sections
Dataset has 113,937 observations of 82 variables. Each observation is of a loan on the prosper platform originated between late 2005 and early 2014. There are many types variables in this dataset: Datetime, Categorical (ordered, un-ordered), Binary, and Numerical.
Datetime: ListingCreationDate, LoanOriginationDate, ClosedDate
Categorical: ProsperRating(Alpha) - ordered categorical variable, LoanStatus/ListingCategory/BorrowerState/Occupation/EmploymentStatus/…
Binary: IsBorrowerHomeowner, IncomeVerifiable
Numerical: LoanOriginationAmount/BankcardUtilication/RevolvingCreditBalance
My main feature of interest is the Loss Rate; LP_NetPrincipalLoss/LoanOriginalAmount. This feature is of interest in terms of understanding how well Prosper is doing in terms of minimizing chargedoff/defaulted loans and how much loss prosper has suffered from these outcomes.
I think we would need to consider at least the following features.
About the Loan:
About the Borrower:
About the lending:
There are probably more features in this dataset that are relevant, but we may consider just a subset of these for simplicity.
Yes:
Yes:
Let’s start out with the 2 features of interest – Prosper’s Score and the LoanStatus.
##
## AA A B C D E HR
## 5372 14551 15581 18345 14274 9795 6935 29084
Okay, the bottom right plot shows that there’s a bunch of loans with no Proser Rating. So, prosper introduced this score July 2009. Therefore, 29084 loans have no “prosper rating”. We may see the CreditGrade for the rating that these scores were given.
##
## AA A B C D E
## 0.030797722 0.029095026 0.038521288 0.049580031 0.045226748 0.028866830
## HR NC
## 0.030788945 0.001237526 0.745885884
## Source: local data frame [9 x 4]
##
## CreditGrade lower upper count
## (fctr) (int) (int) (int)
## 1 AA 760 899 3509
## 2 A 720 759 3315
## 3 B 680 719 4389
## 4 C 640 679 5649
## 5 D 600 639 5153
## 6 E 540 599 3289
## 7 HR 0 559 3508
## 8 NC 0 19 141
## 9 600 899 84984
two interesting things to note here:
So, for borrowers with a credit score range (not NA_NA), you can see a clear delineation between CreditGrades based on the CreditScore. This correlation definitely gives a lot of evidence to indicate that up until July 2009, Prosper graded loan applications almost exclusively based on the CreditScore range.
Let’s go a step further, how did the previous CreditGrade system fair in terms of defaults?
## Source: local data frame [8 x 5]
##
## CreditGrade total cancelled completed chargeoff_or_default
## (fctr) (int) (dbl) (dbl) (dbl)
## 1 AA 3509 0 0.85 0.15
## 2 A 3315 0 0.76 0.24
## 3 B 4389 0 0.68 0.32
## 4 C 5649 0 0.64 0.36
## 5 D 5153 0 0.61 0.39
## 6 E 3289 0 0.51 0.49
## 7 HR 3508 0 0.39 0.61
## 8 NC 141 0 0.30 0.70
The distribution of completions/defaults across CreditGrade look consistent with what an investor would expect. Nevertheless, having a AA loans default around 12-17% of the time is not ideal. Perhaps this number was lowered after the introduction of the Prosper Rating.
Okay, so how has the new Prosper Rating done in terms of mitigating defaults for different grades? (We will only consider closed loans). Predicting whether an open loan will be completed/defaulted is a whole other problem itself, and we don’t have the right data for it.
## Source: local data frame [7 x 5]
##
## ProsperRating..Alpha. total cancelled completed chargeoff_or_default
## (fctr) (int) (dbl) (dbl) (dbl)
## 1 AA 1783 0 0.95 0.05
## 2 A 3608 0 0.89 0.11
## 3 B 3373 0 0.83 0.17
## 4 C 3817 0 0.78 0.22
## 5 D 5869 0 0.71 0.29
## 6 E 3830 0 0.65 0.35
## 7 HR 3725 0 0.62 0.38
Defaults have decreased across all categories of ratings. So, i think the new rating system is performing better than the old one. Perhaps Prosper improved their underwriting process with data collected from their historical loans.
Default rate is just part of it. If 1% of loans default, but those 1% are in the 99th percentile in terms of loan originations, then prosper could still have lost as much money as in the case when a larger percentage of small to midsized loans default. So, let’s take a look at the LP_NetPrincipalLoss per loan for different grades. We do “per loan” to account for the fact that are many more closed loans pre july 2009, so we want the per loan loss, rather than the absolute loss.
## Source: local data frame [2 x 6]
##
## isPreJuly2009Loan total total_net_loss total_originated loss_per_loan
## (lgl) (int) (dbl) (int) (dbl)
## 1 FALSE 26005 31202559 165801431 1199.868
## 2 TRUE 29084 46436448 179137169 1596.632
## Variables not shown: loss_rate (dbl)
So, it’s clear that, for all closed loans, between Pre/Post July 2009, there’s an improvement of:
Let’s continue with this loss_per_loan and loss_rate metric, and see how it varies across different loan types. Which types are more unreliable than others?
Loss Rate Below 10%: RV, Motorcycle, Engagement Ring, Boat Loss Rate Above 30%: Household Expenses Caveats: these categories account for very little of the total originations.
Debt Consolidation/Not Available/Business Loans/etc account for more of the loan originations, and their loss rates are between 18-30%.
Is credit score a good proxy for understanding who will default? let’s take a look at the loss rate by the credit score
Wow, it’s almost linear. I would’ve expected there to be more variation. So, what would be interesting to see is if the Post July 2009 loans (for which credit grades are assigned using Prospers own metric) are not as correlated with the credit score. This would indicate that prosper has found other data to properly assess risk. Quick aside: What’s the difference between Borrower APR and Borrower Rate?
##
## Call:
## lm(formula = BorrowerAPR ~ BorrowerRate, data = subset(df, CreditGrade !=
## ""))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.014134 -0.004627 -0.002311 0.004586 0.045466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.504e-03 9.596e-05 67.77 <2e-16 ***
## BorrowerRate 1.033e+00 4.847e-04 2131.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00614 on 28926 degrees of freedom
## (25 observations deleted due to missingness)
## Multiple R-squared: 0.9937, Adjusted R-squared: 0.9937
## F-statistic: 4.542e+06 on 1 and 28926 DF, p-value: < 2.2e-16
Borrower APR is basically a linear function of Borrower Rate. See the R^2 of .99
APR includes fees that Prosper collects, while Rate excludes these fees. From the regression, we can see that the fees Prosper collects increase linearly with respect to increases in Borrower Rates. Tony Soprano probably wasn’t this nice :)
Back to loss rate. How about examining this by: Borrower State/IsBorrowerHomeowner/IncomeRange/DebtToIncomeRatio
Wyoming has an exceedingly low loss rate - we can see that their total originated loan amount is quite low. California, which originates the highest volumn of loans has a loss rate that is slightly less than 25%.
IsBorrowerHomeowner/IncomeRange
Loss rate doesn’t differ too much across IsHomeownerBorrower.
Loss rate falls as income increases - that is to be expected.
Debt to Income Ratio vs Loss Rate
##
## Call:
## lm(formula = LP_NetPrincipalLoss/LoanOriginalAmount ~ log10(DebtToIncomeRatio +
## 1), data = subset(df, ClosedDate != ""))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6714 -0.2144 -0.2027 0.2362 0.8114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.187328 0.002259 82.94 <2e-16 ***
## log10(DebtToIncomeRatio + 1) 0.290199 0.017370 16.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3549 on 50857 degrees of freedom
## (4230 observations deleted due to missingness)
## Multiple R-squared: 0.005459, Adjusted R-squared: 0.005439
## F-statistic: 279.1 on 1 and 50857 DF, p-value: < 2.2e-16
From the scatter plots, it seems like there is some positive correlation for high enough values of log10(DebtToIncomeRatio+1) and the loss rate. But overall, this explanatory variable does not explain much of the variation in the loss rate. In fact, based on the R^2 value, the DebtToIncomeRatio explains about .5% of the variation in the LP_NetPrincipalLoss/LoanOriginalAmount. That’s extremely low.
What about the Loss Rate across: Loan Size, Term, BorrowerAPR
## [1] -0.1174785
Pre July 2009, CreditGrade was a direct function of CreditRange. Post July 2009, it is a bit more complex to figure out what constitutes the Prosper Rating.
Loss Rate was examined against many different features.
Yes:
Let’s dig further into the Loss Rate’s relationship across listing categories weighted by the total volume of loans originated from the listing.
The first bubble chart shows us the Loss Rate across listings with the bubbles sized by the log10 of the total volume of loans originated for that listing. We can see that the largest bubbles typically have loss rates in the 15-30% range. The categories with low loss rates have small bubbles, indicating a very small proportion of originated loans are from listings in these categories.
We break this chart up into 3, 1 for each Term size for a loan. It’s quite clear that regardless of the listing category, the loss rate seems to increase as the loan term increased. We can see that the loss rates for some categories such as Home Improvement have dramatic leaps as we go from 12->36->60 month loans. Debt Consolidation, the largest “Available” category by loan volume originated, has a Loss Rate that goes from 3% (12 month) -> 18% (36 month) -> 26% (60 month). This trajectory, particularily the increase from 36 month to 60 month, is actually better than the trajectories for Home Improvement/Household Expenses, and many other categories.
Let’s consider the loss rate in terms of employment status
Once again, independent of employment status, loss rate tends to be higher with longer term loans.
Let’s consider the loss rate in terms of the term and income range
For high incomes ($100,000+), the loss rate seems to level out at 36 months at ~22.5%. So, in terms of just the loss rate, Prosper could be indifferent between giving a borrower with an income of $100,000 a loan that lasts 36 months/60 months. This information is important, as prosper can adjust it’s sales/fees approach to make the loan product that will be most suitable to ‘high-income borrowers’.
loss rate across term and credit range
Loss rate increases across terms and decreases within a term across credit range.
Loss rate across credit range and income range
Loss rate increases across income ranges, and decreases within an income range across credit ranges.
Finally, let’s try a simple model to predict loss rate
##
## Call:
## lm(formula = LossRate ~ BorrowerRate + Term + CreditRange + IncomeRange,
## data = subset(df, ClosedDate != ""))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9594 -0.2356 -0.1197 0.2184 1.0600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2095849 0.0343066 6.109 1.01e-09 ***
## BorrowerRate 0.9874090 0.0214048 46.130 < 2e-16 ***
## Term 0.0041755 0.0001995 20.930 < 2e-16 ***
## CreditRange360 _ 379 -0.0411836 0.3390156 -0.121 0.90331
## CreditRange420 _ 439 0.1632677 0.1538580 1.061 0.28862
## CreditRange440 _ 459 0.1339846 0.0634552 2.111 0.03474 *
## CreditRange460 _ 479 0.0712817 0.0408267 1.746 0.08082 .
## CreditRange480 _ 499 0.0210634 0.0344599 0.611 0.54104
## CreditRange500 _ 519 -0.0700307 0.0326144 -2.147 0.03178 *
## CreditRange520 _ 539 -0.1380325 0.0306904 -4.498 6.89e-06 ***
## CreditRange540 _ 559 -0.1901940 0.0306766 -6.200 5.69e-10 ***
## CreditRange560 _ 579 -0.2245286 0.0308943 -7.268 3.71e-13 ***
## CreditRange580 _ 599 -0.2571038 0.0311834 -8.245 < 2e-16 ***
## CreditRange600 _ 619 -0.2622299 0.0301587 -8.695 < 2e-16 ***
## CreditRange620 _ 639 -0.2846251 0.0301350 -9.445 < 2e-16 ***
## CreditRange640 _ 659 -0.2899303 0.0299322 -9.686 < 2e-16 ***
## CreditRange660 _ 679 -0.3120850 0.0299634 -10.416 < 2e-16 ***
## CreditRange680 _ 699 -0.3070875 0.0299682 -10.247 < 2e-16 ***
## CreditRange700 _ 719 -0.3074518 0.0300353 -10.236 < 2e-16 ***
## CreditRange720 _ 739 -0.3098907 0.0300924 -10.298 < 2e-16 ***
## CreditRange740 _ 759 -0.3182031 0.0302173 -10.530 < 2e-16 ***
## CreditRange760 _ 779 -0.3122257 0.0303663 -10.282 < 2e-16 ***
## CreditRange780 _ 799 -0.3266017 0.0306216 -10.666 < 2e-16 ***
## CreditRange800 _ 819 -0.3312094 0.0311465 -10.634 < 2e-16 ***
## CreditRange820 _ 839 -0.3428484 0.0320105 -10.711 < 2e-16 ***
## CreditRange840 _ 859 -0.3504755 0.0345278 -10.151 < 2e-16 ***
## CreditRange860 _ 879 -0.3406053 0.0394299 -8.638 < 2e-16 ***
## CreditRange880 _ 899 -0.3691003 0.0779624 -4.734 2.20e-06 ***
## CreditRangeNA _ NA -0.2814501 0.0324391 -8.676 < 2e-16 ***
## IncomeRange$0 0.0145959 0.0200974 0.726 0.46768
## IncomeRange$1-24,999 -0.0327570 0.0155077 -2.112 0.03466 *
## IncomeRange$25,000-49,999 -0.0417113 0.0149126 -2.797 0.00516 **
## IncomeRange$50,000-74,999 -0.0716346 0.0149735 -4.784 1.72e-06 ***
## IncomeRange$75,000-99,999 -0.0823963 0.0152675 -5.397 6.81e-08 ***
## IncomeRange$100,000+ -0.0824090 0.0153164 -5.380 7.46e-08 ***
## IncomeRangeNot displayed -0.0364875 0.0154771 -2.358 0.01840 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3377 on 55053 degrees of freedom
## Multiple R-squared: 0.1159, Adjusted R-squared: 0.1154
## F-statistic: 206.2 on 35 and 55053 DF, p-value: < 2.2e-16
Using the BorrwowerRate, Term, CreditRange, and IncomeRange to predict the unweighted loss rate for closed loans, we can see that 11.59% of the variation in the unweighted loss rate is explained by these factors. There is many more columns yet to be explored and subsequent analysis it would worthwhile to consider these factors and see if they can improve the predictive power of this model.
Will take a detour here and look how much Prosper has changed in various states over time in terms of originations and loss_rate
Prosper’s originations started primarily in its home state of California, and then experienced success in Texas, and many states along the East Coast. Since this data is last updated midway through 2014, the intensity of the 2014 heat map is less than that of 2013 – but, i suspect that by the close of 2014, there is probably more penetration by Prosper in the various parts of the US market.
In this part of the investigation, I looked at how Loss Rate and Loan Originations vary across Listing Categories, Terms, Employment Statueses, IncomeRange and CreditRange. Also, I considered how Prosper’s loan originations have grown across the continental US by year since inception.
In the multivariate analysis, i could not find features that strengthened each other. Relationships such as Term and Listing Category did not strengthen each other when examined against the Loss Rate. Terms/IncomeRange had some interesting interactions (described in the next section), but again, the features did not seem to strengthen each other.
Yes - For high income borrowers, 36 month and 60 month loss rates were the same. The expectation was that 60 month loss rate is higher than 36 month loss rate regardless of the Income.
I created a simple model to predict the unweighted LossRate for a loan given the:
The model could only explain 11.59% of the variation in the LossRate. There are clearly many latent variables that have not been considered and incorporated into the model. There are some potentially useful features however, like IncomeRange, Term, and BorrowerRate already in the model. So, perhaps this serves as a good foundation to build a more comprehensive linear regression.
Change in Loss Rate by CreditGrade/ProsperRating Pre/Post July 2009. We can see that the Loss Rate improves in every grade after the loan grading system changes from the CreditGrade to the ProsperRating.
Loss Rate falls for all groups. Variation in the loss rate across Income Ranges increases as Term size increases. Finally, for high-income borrowers (>$100,000) of 36/60 month loans, the loss rate is nearly the same.
Excluding 2014 (do not have a full year’s worth of data), prosper loans originations seem to increase across the US states over time. We see that as of 2013, there are no loans in North Dakota, Iowa and Maine. In the past, take 2008 for instance, there were loan originations in these states. In fact, according to this source (http://www.lendingmemo.com/lending-club-and-prosper-states/), 3 states are not open to p2p lending: North Dakota, Iowa and Maine. So, legislation has actually closed these states off from certain lending marketplaces. Aside from this, we can see that California, Texas, New York and Florida are the hotbeds of loan originations. It’s no coincidence that these states also have the highest electoral counts. This is probably more so a testament to the large populations of these states, rather than the level of “p2p lending interest” in these regions (compared to others).
This dataset has 81 variables. This makes it very challenging to do EDA on it without having much domain knowledge about this space. It was difficult to get started with so many variables, but the key to moving forward was having a business scenario. Focusing on the loss rate as a metric provided a fixed point and helped guide the analysis. I had a hypothesis that various features would be predictive of the loss rate, and built a linear model with these features, but the outcome showed that the chosen features did not explain much of the variance in the loss rate. Given more time, it would prudent to look deeper into other features of these dataset; especially those describing more detailed info about the borrowers credit history: open lines of credit, inquires, closed lines, etc. However, I was able to garner some insights and generate 3 final charts that illustrate them.